Member:

Huilin Zhou (hz2507)
Zhuo Li (zl2637)
Qing Xu (qx2178)
Zhaoyu Liu (zl2638)

The github link is:https://github.com/xuqingsally/p8105-final-project

Motivation:

Many of us will find a job in recent years. In New York City government area, what type of jobs can be payed for higher salary? If that’s the ideal job for me , what skills should I have before applying for it? Also where should I rent an apartment if I want to live close to that type of positions? By looking at the NYC jobs data set from 2013 to 2017,we wish to give some advice on job catrgories, required skills and salary range for people who are seeking for a job.

Initial questions:

We are trying to analyze the NYC government jobs based on the salary, the requirement of the degree, locations, job type and also required skills to provide people with a viewable results to let them find the potential suitable job.
We did the research on the question like where to find a suitable job with the highest salary. What’s the highest base salary job in need of specific education degree?

While doing the analysis of the dataset and making the visualization plots, we started to think about making the wordcount to analyze the preferred skills in different job category to better find out the potential suitable job for different people. Besides, we made the bar chart to find out the number of positions for each job category in each year from 2013 to 2017, and we found out the number of job positions in each category keeps increasing, which give us better evidence to research on the question of finding suitable jobs.

After analysizing job information among the data set,we will consider linearity regression model to test the association between base salary and the level of the position in the analysis. Also we want to know whether these conclusions can be generalized to other cities.

Data:

Source: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data
Scrape:We downloaded the data set from the website and the original data set contains 3174 job information. We selected data with information about work location, job category, preferred skills and full/part time. We want to use Google Map to get longtitute and latitute from work locations. Since the Google map has a limit of 2500 observation one time, we selected first 2500 observations from the data set.
Cleaning:
1. Merge job category if they are the same kind but just have different names. Finally we got 12 kinds of job categories.
2. Unify salary unit to “Annual” and recalculated the salary range and average salary.
3. Use Google Map to change location to longitute and latitute.
4. Since Google map has limitation, we wrote a new data set after cleaning it and use it in the after analysize.

#clean data
df<- read.csv("NYC_Jobs.csv") %>%
  clean_names() %>%
  filter(job_category!= " ", 
         full_time_part_time_indicator!=" ", 
         preferred_skills!= " ",
         work_location!=" ") %>%
  head(2500) #google map limited to 2500
df$job_category<- as.character(df$job_category)

#combine job category
for (i in 1:2500) {
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("engineering", ignore_case = TRUE))] <- "Engineering"
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("finance", ignore_case = TRUE))] <- "Finance"
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("legal", ignore_case = TRUE))] <- "Legal Affairs"
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("technology", ignore_case = TRUE))] <- "Technology"
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("policy", ignore_case = TRUE))] <- "Policy"
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("communi", ignore_case = TRUE))] <- "Community"
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("clerical", ignore_case = TRUE))] <- "Clerical"
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("social service", ignore_case = TRUE))] <- "Social service"
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("public safety", ignore_case = TRUE))] <- "Public safety"
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("administration", ignore_case = TRUE))] <- "Administration"
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("operations", ignore_case = TRUE))] <- "Maintance"
    df$job_category[[i]][str_detect(df$job_category[[i]],regex("maintenance ", ignore_case = TRUE))] <- "Maintance"
    
}


# change unit of salary and calculate mean
df$salary_frequency<-as.character((df$salary_frequency))
df$salary_range_from<- as.numeric(df$salary_range_from)
df$salary_range_to<- as.numeric(df$salary_range_to)
df$salary_range_from <- with(df, ifelse(salary_frequency == "Hourly", salary_range_from*1825,salary_range_from))
df$salary_range_to <- with(df, ifelse(salary_frequency == "Hourly", salary_range_to*1825,salary_range_to))
df$salary_range_from <- with(df, ifelse(salary_frequency == "Daily", salary_range_from*365*5/7,salary_range_from))
df$salary_range_to <- with(df, ifelse(salary_frequency == "Daily", salary_range_to*365*5/7,salary_range_to))

df<-mutate(df,salary_frequency="Annual") %>%
          mutate(salary_mean = (salary_range_to-salary_range_from)/2+salary_range_from)
df$salary_frequency<-as.character((df$salary_frequency))
df$salary_range_from<- as.numeric(df$salary_range_from)
df$salary_range_to<- as.numeric(df$salary_range_to)
df$salary_range_from <- with(df, ifelse(salary_frequency == "Hourly", salary_range_from*1825,salary_range_from))
df$salary_range_to <- with(df, ifelse(salary_frequency == "Hourly", salary_range_to*1825,salary_range_to))
df$salary_range_from <- with(df, ifelse(salary_frequency == "Daily", salary_range_from*365*5/7,salary_range_from))
df$salary_range_to <- with(df, ifelse(salary_frequency == "Daily", salary_range_to*365*5/7,salary_range_to))

df<-mutate(df,salary_frequency="Annual") %>%
          mutate(salary_mean = (salary_range_to-salary_range_from)/2+salary_range_from)


# get longtitude and latitude
df$work_location<-as.character(df$work_location)
geo <- geocode(location = df$work_location, output="latlon", source="google")
df<- df %>%
  mutate(lon = geo$lon, lat = geo$lat) %>%
  group_by(job_category)

df <- df %>%
  mutate(text_label = str_c("Annual mean salary$:",salary_mean))
# write a new data set and use it in the following analysize
write.csv(df,file = "job.csv",row.names=FALSE)

Exploratory analysis:

Location map

First, we use longtitue and latitute of 2500 observations and plot their location in the map by different job category and text with average annual salary.

# plotly worldwild
Sys.setenv('MAPBOX_TOKEN' = 'pk.eyJ1IjoieHVxaW5nc2FsbHkiLCJhIjoiY2phZWh0djdyMHUzZTJ3bGR3MHFsdmIzZSJ9.vhYtu7zeAAuX6slhdDj6lA')

nyc_jobs=read_csv("job.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   job_id = col_integer(),
##   x_of_positions = col_integer(),
##   salary_range_from = col_double(),
##   salary_range_to = col_double(),
##   posting_date = col_datetime(format = ""),
##   post_until = col_datetime(format = ""),
##   posting_updated = col_datetime(format = ""),
##   process_date = col_datetime(format = ""),
##   salary_mean = col_double(),
##   lon = col_double(),
##   lat = col_double()
## )
## See spec(...) for full column specifications.

plot_mapbox(nyc_jobs,lat = ~lat, lon = ~lon,
              size=2,
              split = nyc_jobs$job_category,
              mode = 'scattermapbox') %>%
  add_markers(
    text = ~text_label,
    color = ~job_category,size = I(9)) %>%
  layout(title = 'Work Location',
         font = list(color='white'),
         plot_bgcolor = '#191A1A', paper_bgcolor = '#191A1A',
         mapbox = list(style = 'dark'),
         legend = list(orientation = 'h',
                       font = list(size = 8)),
         margin = list(l = 25, r = 25,
                       b = 25, t = 25,
                       pad = 2))
## Warning: Ignoring 81 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

From the map, we found that most of jobs located in New York but also there are some jobs located all over the world. In New York, most jobs located in the lower Manhattan and Brooklyn Heights where are prosperous places. Zoom in the map to see details of the distribution of jobs.

Boxplot of salary in different job category

We draw the distribution of average annual salary of different job categories.

nyc_jobs %>%
plot_ly(y = ~salary_mean, color = ~job_category, type = "box",
          colors = "Set2") %>%
layout(title = "Annual average salary distribution in 12 job categories")

From the boxplot, we found that “Technology” and “Engineering” jobs have higher average salary and “Clerical” jobs has lower average salary among the 12 job categories. It makes sense that the first two job categories need professional skills and knowledge while those skills are unnecessary for clerical work. Also there exists some outliers in several categories. If you really did well in your position, it doesn’t matter what’s your job category. It is important that you do the work that interests you, but at the same time, have a job in technology and engineering may be more well-paid.

# Job category with different educational requirement and base salary
job_data = nyc_jobs %>%
  select(job_category, salary_range_from, salary_range_to,
         minimum_qual_requirements, full_time_part_time_indicator,
         salary_frequency) %>%
  filter(job_category!= " ", minimum_qual_requirements!=" ",
         full_time_part_time_indicator == "F", salary_frequency == "Annual")


x = c("baccalaureate", "Bachelor")
y = c("Master", "master")

master_data = filter(job_data, grepl(paste(y, collapse = "|"), minimum_qual_requirements),
                     !grepl(paste(x, collapse = "|"), minimum_qual_requirements))
  

baccalaureate_data = filter(job_data, grepl(paste(x, collapse = "|"), minimum_qual_requirements),
                            !grepl(paste(y, collapse = "|"), minimum_qual_requirements))

Other_data = filter(job_data, !grepl(paste(y, collapse = "|"), minimum_qual_requirements),
                    !grepl(paste(x, collapse = "|"), minimum_qual_requirements))

Boxplots of salary for different educational requirement.

plot_ly(master_data, y = ~salary_range_from, color = ~job_category, type = "box", colors = "Set2") %>%

layout(title = "Base salary of jobs required at least master degree")

plot_ly(baccalaureate_data, y = ~salary_range_from, color = ~job_category, type = "box", colors = "Set2") %>%
  layout(title = "Base salary of jobs required baccalaureate degree(No need master)")
  
plot_ly(Other_data, y = ~salary_range_from, color = ~job_category, type = "box",
          colors = "Set2") %>%
  layout(title = "Base salary of different kind of jobs without requirement of degree")

For those three plots, we see that in the boxplot of base salary of jobs required for at least master degree, engineering has the highest median base salary and the overall base salary of Legal Affairs are pretty high compared with other job category. While in the boxplot of base salary of jobs required for baccalaureate degree, technology has the highest median base salary, and the overall salary of Administration is pretty high compared with others. In the boxplot of base salary without requirement of degree, the median base salary and the overall base salary of Engineering are all pretty high compared with other job categories.

Comparing those three plots, we can see the overall base salary of jobs required for at least master degree are higher than the overall base salary of jobs required for baccalaureate. The distribution of the base salary of jobs without degrees’ requirement is not centralized and the base salaries mainly depend on the job positon in this case.

The wage increasing ranges of different jobs

job_data = mutate(job_data, salary_range = salary_range_to - salary_range_from)

plot_ly(job_data, y = ~salary_range, x = ~job_category, type = "bar") %>%
  layout(title = "The wage increasing ranges of different kinds of jobs")

From the plot, we can see that the job category of Legal Affairs has the largest wage increasing range and the job category of Clerical has the smallest wage increasing range compared with other different job categories.

Table of positions, agency and job category.

job_positions = nyc_jobs %>%
  select(x_of_positions, agency, job_category, salary_mean) %>%
  distinct() %>%
  group_by(agency, job_category) %>%
  summarise(positions = sum(x_of_positions)) %>%
  arrange(desc(positions))

# Number of job positions: top 10
knitr::kable(head(job_positions, 10))
agency job_category positions
DEPT OF HEALTH/MENTAL HYGIENE Health 382
ADMIN FOR CHILDREN’S SVCS Social service 100
DEPT OF ENVIRONMENT PROTECTION Engineering 82
DEPARTMENT OF BUILDINGS Public safety 71
DEPARTMENT OF PROBATION Public safety 65
DEPARTMENT OF TRANSPORTATION Engineering 65
DEPARTMENT OF TRANSPORTATION Maintance 60
LAW DEPARTMENT Legal Affairs 56
NYC HOUSING AUTHORITY Maintance 56
DEPT OF HEALTH/MENTAL HYGIENE Community 55

The table above shows the top-10 number of job positions in NYC from 2011 to 2017. DEPT OF HEALTH/MENTAL HYGIENE with job category “Health” has the most job positions.

Bar plot of job category vs. number of positions of each year.

positions_plot = nyc_jobs %>%
  select(x_of_positions, agency, job_category, salary_mean, posting_date) %>%
  distinct() %>%
  separate(posting_date, into = c("year", "month", "day"), sep = "-") %>%
  select(-month, -day) %>%
  group_by(job_category, year) %>%
  summarise(positions = sum(x_of_positions))

plot_ly(positions_plot, x = ~job_category, y = ~positions,
        color = ~year, type = "bar") %>%
  layout(title = "Number of positions of job categories in each year",
         barmode = "group")

The bar chart above shows the number of positions for each job category in each year. From 2013 to 2017, the number of job positions in each category keeps increasing. And all categories have a dramatically increase in job positions in 2017. This might because more and more companies were founded and developed in 2017, thus they need more employees joining in. Besides, since this dataset contains all job information from NYC official job website, as the year increases, more people found this website and created job postings on this site.

Wordcount analysis of Preffered Skills and Minimum Qual Requirements

nyc_jobs = nyc_jobs%>% 
  ungroup()%>%
  mutate(  minimum_qual_requirements = as.character(minimum_qual_requirements))%>%
  mutate(preferred_skills = as.character(preferred_skills))

jobs_words_skill =  nyc_jobs%>%
  unnest_tokens(word,preferred_skills)%>%
  anti_join(stop_words)%>%
  inner_join(., parts_of_speech) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## Joining, by = "word"
jobs_words_requirement =  nyc_jobs%>%
  unnest_tokens(word,minimum_qual_requirements)%>%
  anti_join(stop_words)%>%
  inner_join(., parts_of_speech) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## Joining, by = "word"

jobs_words_skill %>% 
  top_n(20) %>% 
  mutate(word = fct_reorder(word, n)) %>% 
  plot_ly(y = ~word, x = ~n, color = ~word, type = "bar")%>%
  layout(title = "Preferred Skills Word Counts")
## Selecting by n
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

jobs_words_requirement %>% 
  top_n(20) %>% 
  mutate(word = fct_reorder(word, n)) %>% 
  plot_ly(y = ~word, x = ~n, color = ~word, type = "bar")%>%
  layout(title = "Minimum Qual Requirements")
## Selecting by n
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

The word count plots of prefferred skills and minimum qual requirements. The word count plots of prefferred skills and minimum qual requirements. After deleting meaningless words, experience is the most used word both in prefeered skills and minimum requirement. We could see that word and excel is still the most common skill needed, and other skills for communication like written, verbal public speaking and oral is also like basic skills needed. For miminum qual requirements, education is really important, for that there are several words related to education shown on the plot such like “school”, “college”, “degree”, “education”, “graduate”. Those words indicates a high requirement of education like bacherlor or master degree. And the word york indicates the requirement of residency in New York. For different knids of job, the skill and minimum requirement may change. The result of this part will be shown on the Shinny Website. The word clouds of skills and minimum requirement are shown below, which indicates a more straight view of word frequency.

Word Cloud of Preffered Skills and Minimum Qual Requirements

set.seed(123)
wordcloud2(jobs_words_skill, size = 2,color = 'random-light',  
           backgroundColor = "gray", fontWeight='bold',  
           minRotation = -pi/3, maxRotation = pi/3,rotateRatio = 0.8)  
## Warning in if (class(data) == "table") {: 条件的长度大于一,因此只能用其第
## 一元素
wordcloud2(jobs_words_requirement, size = 2,color = 'random-light',  
           backgroundColor = "gray", fontWeight='bold',  
           minRotation = -pi/3, maxRotation = pi/3,rotateRatio = 0.8)  
## Warning in if (class(data) == "table") {: 条件的长度大于一,因此只能用其第
## 一元素

Additional analysis:

Linear regression analysis

Discussion:

There are more job positions located at prosperous places and downtown. It makes sense that companies and departments always located in downtown.
We found that for base salary of jobs required for at least master degree, engineering has the highest median base salary, for base salary of jobs required for baccalaureate degree, technology has the highest median base salary, and for base salary without requirement of degree, the median base salary and the overall base salary of Engineering are all pretty high compared with other job categories.
From 2013 to 2017, the number of job positions in each category keeps increasing. And in 2017 all categories have a dramatically increase in job positions. This might because there are more positions needed in government during recent year. Engineering positions has greatest increasing between 2016 to 2017,and job in health has great increasing between 2015 to 2016. We may predict that there will be great needed for talents in health and engineering area. While jobs in maintance has decrease between 2013 to 2014. Maybe this is because of the development of technology, fewer manpower is needed in maintance.
“Experience” is the top popular word in both preferred skills and requirements. It is important for people who want a job to get prepared. We may do some internship or related project to make us more competitive. Other top words in preferred skills are “written”,“project” and “team” . Employers always preferred candidates with strong skills especially written skills and team work.In requirements, “school” appears frequently. We may infer that education background is also important.
Linear